2 Grammar of Graphics
Introduction to the R Programming Language
3 The Tidy Approach
3.0.1 Opinionated software
Opinionated software is a software product that believes a certain way of approaching a business process is inherently better and provides software crafted around that approach. ~ Stuart Eccles
3.0.2 Tidy data
The defining opinion of the tidyverse is its wholehearted adoption of tidy data. Tidy data has three features:
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a dataframe. (This is from the paper, not the book)
Source: R for Data Science
Tidy data was formalized by Hadley Wickham in “Tidy Data” in the Journal of Statistical Software in 2014. It is equivalent to Codd’s 3rd normal form (Codd, 1990) for relational databases.
Tidy datasets are all alike, but every messy dataset is messy in its own way. ~ Hadley Wickham
The tidy approach to data science is powerful because it breaks data work into two distinct parts.
- First, get the data into a tidy format.
- Second, use tools optimized for tidy data.
By standardizing the data structure for most community-created tools, the framework oriented diffuse development and reduced the friction of data work.
4 Grammar of Graphics
ggplot2 is an R package for data visualization that was developed during Hadley Wickham’s graduate studies at Iowa State University. ggplot2 is formalized in “A Layered Grammar of Graphics” by Hadley Wickham, which was published in the Journal of Statistical Software in 2010.
The grammar of graphics, originally by Leland Wilkinson, is a theoretical framework that breaks all data visualizations into their component pieces. With the layered grammar of graphics, Wickham extends Wilkinson’s grammar of graphics and implements it in R. The cohesion is impressive and the theory flows to the code which informs the data visualization process in a way not reflected in any other data viz tool.
There are eight main ingredients to the grammar of graphics. We will work our way through the ingredients with many hands-on examples.
4.0.1 Exercise 0
Step 1: Open 2024_asa-data-viz.Rproj.
Step 2: Open 02_workbook.Rmd.
4.0.2 Exercise 1
Step 1: Type (don’t copy & paste) the following code below library(tidyverse) in the new chunk where it says # YOUR WORK GOES HERE.
Step 2: Add a comment above the ggplot2 code that describes the plot we created.
Step 3: As we progress, add comments below the data visualization code that describe the argument or function that corresponds to each of the first three components of the grammar of graphics.
1 Data are the values represented in the visualization.
ggplot(data = ) or data %>% ggplot()
# A tibble: 19,537 × 7
name year category lat long wind pressure
<chr> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 Amy 1975 NA 27.5 -79 25 1013
2 Amy 1975 NA 28.5 -79 25 1013
3 Amy 1975 NA 29.5 -79 25 1013
4 Amy 1975 NA 30.5 -79 25 1013
5 Amy 1975 NA 31.5 -78.8 25 1012
6 Amy 1975 NA 32.4 -78.7 25 1012
7 Amy 1975 NA 33.3 -78 25 1011
8 Amy 1975 NA 34 -77 30 1006
9 Amy 1975 NA 34.4 -75.8 35 1004
10 Amy 1975 NA 34 -74.8 40 1002
# ℹ 19,527 more rows
2 Aesthetic mappings are directions for how data are mapped in a plot in a way that we can perceive. Aesthetic mappings include linking variables to the x-position, y-position, color, fill, shape, transparency, and size.
aes(x = , y = , color = )
X or Y
Color or Fill
Size
Shape
Others: transparency, line type
3 Geometric objects are representations of the data, including points, lines, and polygons.
Plots are often called their geometric object(s).
geom_bar() or geom_col()
geom_line()
geom_point()
4.0.3 Exercise 2
Step 1: Duplicate the code from exercise 1. Inside aes(), add color = category. Run the code.
Step 2: Replace color = category with color = "green". Run the code. What changed? Is this unexpected?
Step 3: Remove color = "green" from aes() and add it inside inside of geom_point() but outside of aes(). Run the code.
Step 4: This is a little cluttered. Add alpha = 0.2 inside geom_point() but outside of aes().
Aesthetic mappings like x and y almost always vary with the data. Aesthetic mappings like color, fill, shape, transparency, and size can vary with the data. But those arguments can also be added as styles that don’t vary with the data. If you include those arguments in aes(), they will show up in the legend (which can be annoying! and is also a sign that something should be changed!).
4.0.4 Exercise 3
Step 1: Create a new scatter plot using the msleep data set. Use bodywt on the x-axis and sleep_total on the y-axis.
Step 2: The y-axis doesn’t contain zero. Below geom_point(), add scale_y_continuous(limits = c(0, NA)). Hint: add + after geom_point().
Step 3: The x-axis is clustered near zero. Add scale_x_log10() above scale_y_continuous(limits = c(0, NA)).
Step 4: Add and run options(scipen = 999). Rerun the code from steps 1-3.
4 Scales turn data values, which are continuous, discrete, or categorical into aesthetic values. scale_*_*() functions control the specific behaviors of aesthetic mappings. This includes not only the x-axis and y-axis, but the ranges of sizes, types of shapes, and specific colors of aesthetics.
There are dozens of scale functions and their names follow a formula:
- They all start with
scale_. - Next, comes the name of the aesthetic for the scale (i.e.
x,y,fill,size, etc.). - Finally, comes the type of variable or transformation (i.e.
discrete,continuous, andreverse).
scale_x_continuous() and scale_y_continuous() are two popular scale_*_*() functions.
Before
scale_x_continuous()
After
scale_x_reverse()
Before
scale_size_continuous(breaks = c(25, 75, 125))
After
scale_size_continuous(range = c(0.5, 20), breaks = c(25, 75, 125))
4.0.5 Exercise 4
Step 1: Type the following code in your script.
data <- tibble(x = 1:10, y = 1:10)
ggplot(data = data) +
geom_blank(mapping = aes(x = x, y = y))
Step 2: Add coord_polar() to your plot.
Step 3: Add labs(title = "Polar coordinate system") to your plot.
5 Coordinate systems map scaled geometric objects to the position of objects on the plane of a plot. The two most popular coordinate systems are the Cartesian coordinate system and the polar coordinate system.
coord_polar()
4.0.6 Exercise 5
Step 1: Create a scatter plot of the storms data set with pressure on the x-axis and wind on the y-axis.
Step 2: Add facet_wrap(~ month)
6 Facets (optional) break data into meaningful subsets. facet_wrap(), facet_grid(), and facet_geo().
facet_wrap()
facet_wrap(~ category)
facet_grid()
facet_grid(month ~ year)
4.0.7 Exercise 6
Step 1: Add the following code to your script. Submit it!
ggplot(storms) +
geom_bar(mapping = aes(x = category))
7 Statistical transformations (optional) transform the data, typically through summary statistics and functions, before aesthetic mapping.
Before transformations, each observation in data is represented by one geometric object (i.e. a scatter plot). After a transformation, a geometric object can represent more than one observation (i.e. a bar in a histogram).
Note: geom_bar() performs statistical transformation. Use geom_col() to create a column chart with bars that encode individual observations in the data set.
4.0.8 Exercise 7
Step 1: Duplicate Exercise 6.
Step 2: Add theme_minimal() to the plot.
4.0.9 Themes
8 Theme controls the visual style of plot with font types, font sizes, background colors, margins, and positioning.
Default theme
Theme Minimal
fivethirtyeight theme
urbnthemes
If you prefer the minimal theme, you can add theme_minimal() to each visualization or add theme_set(theme_minimal) at the beginning of your script.
4.0.10 Exercise 8 (layers!)
Step 1: Add the following exercise to you script. Run it!
storms %>%
filter(category > 0) %>%
distinct(name, year) %>%
count(year) %>%
ggplot() +
geom_line(mapping = aes(x = year, y = n))
Step 2: Add geom_point(mapping = aes(x = year, y = n)) after geom_line().
Layers allow for multiple geometric objects to be plotted in the same data visualization.
4.0.11 Exercise 9
Step 1: Add the following exercise to you script. Run it!
ggplot(data = storms, mapping = aes(x = pressure, y = wind)) +
geom_point() +
geom_smooth()
Inheritances pass aesthetic mappings from ggplot() to later geom_*() functions.
Notice how the aesthetic mappings are passed to ggplot() in example 9. This is useful when using layers!
4.0.12 Exercise 10
Step 1: Pick your favorite plot from exercises 1 through 9 and duplicate it in a new code chunk.
Step 2: Add ggsave(filename = "favorite-plot.png") and then look at the saved file.
Step 3: Add width = 6 and height = 4 to ggsave().
5 Review
5.0.1 Theory
- Data
- Aesthetic mappings
- Geometric objects
- Scales
- Coordinate systems
- Facets
- Statistical transformations
- Theme
5.0.2 Functions
ggplot()aes()geom_*()geom_point()geom_line()geom_col()
scale_*_*()scale_y_continuous()
coord_*()facet_*()labs()ggsave()